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Abstract 

We study a problem of model selection for data produced by two dif- 
ferent context tree sources. Motivated by linguistic questions, we consider 
the case where the probabilistic context trees corresponding to the two 
sources are finite and share many of their contexts. In order to understand 
the differences between the two sources, it is important to identify which 
contexts and which transition probabilities are specific to each source. We 
consider a class of probabilistic context tree models with three types of 
contexts: those which appear in one, the other, or both sources. We use a 
BIC penalized maximum likelihood procedure that jointly estimates the 
two sources. We propose a new algorithm which efficiently computes the 
estimated context trees. We prove that the procedure is strongly consis- 
tent. We also present a simulation study showing the practical advantage 
of our procedure over a procedure that works separately on each dataset. 

Key words: BIC, context tree Models, joint estimation, penalized 
maximum likelihood, variable length markov chains. 



1 Introduction 



We assign probabilistic context tree models to data produced by two dif- 
ferent sources on the same finit e alphabe t A. P robabilistic context tree 
models were first introduced in iRissanenI (|l983l ) as a flexible and parsi- 



monious model for data compression. Originally called by Rissanen finite 
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memory source or probabilistic tree, this class of models recently became 
popular in the statistics literature under the name of Variable Length 
Markov Chains fVLMC) Btihlmann fc Wvneil jl999l ). The idea behind 
the notion of variable memory models is that, given the whole past, the 
conditional distribution of each symbol only depends on a finite part of 
the past and the length of this relevant portion is a function of the past it- 
self. Following Rissanen we call context the minimal relevant part of each 
past. The set of all contexts satisfies the suffix property which means that 
no context is a proper suffix of another context. This property allows us 
to represent the set of all contexts as a rooted labeled tree, by reading 
the contexts' symbols from the nodes to the root. With this representa- 
tion, the process is described by the tree of all contexts, called context 
tree, together with a family of probability measures on A indexed by the 
contexts. In this work we shall only consider finite context trees. The 
probability distribution of a context gives the transition probability to 
the next symbol from any past having this context as a suffix. From now 
on, the pair composed by the context tree and its family of probability 
measures will be called probabilistic context tree. 

The issue we consider here w as suggested by a linguistic case study 
presented in iGalves et al\ (|2009l ) . This paper addresses the problem of 



characterizing rhythmic patterns displayed by two variants of Portuguese: 
Brazilian and European. This is done by considering two data sets consist- 
ing of encoded newspaper texts in two languages. Each data set was anal- 
ysed separately using a penalized maximum likelihood procedure which 
selected two different probabilistic context trees corresponding to the two 
variants of Portuguese. A striking feature emerging from this analysis 
is the fact that most of the contexts and corresponding transition prob- 
abilities are common to the two dialects of Portuguese. Obviously the 
discriminant features characterizing the different rhythms implemented 
by the two dialects are expressed by the contexts which appear in one but 
not in the other model. 

To identify those discriminant contexts, the first idea is to estimate 
separately the context tree for each set of observation s, using some cla ssi- 
cal context tree estimator like the algorithm Co ntext Rissanen J 1983 ) or 
a penaliz ed maximum likelihood proc edure as in lCsiszar fc Talatal (|2006l l 



(see also 



Garivier fc Leonard) 



1 201ll )l. and then compare t he obtained 
(200g|). However, 



Galves et al 



trees. This is precisely what is done in 
such an approach does not use the information that the two sources share 
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some identical contexts and probability distributions. We propose in this 
paper a selection method using penalized maximum likelihood for the 
whole set of observations. 

In this paper, we argue that a joint model selection more efficiently 
identifies the relevant features and estimates the parameters. The joint 
estimation of the two probabilistic context trees is accomplished by a pe- 
nalized maximum likelihood criterium. Namely, we distinguish two types 
of contexts: those which appear in both sources with the same probability 
distribution (we call them shared contexts), and the others. The latters 
appear either in only one of the two sources, or appear in both sources 
but with different associated probability distributions. 

At first sight the huge number of models in the class suggests that 
such a procedure is intractable. Actually this is not the case. We show 
that the Context Tre e Maximizing procedure, which has been described in 
Willems et al\ (llQQa ). can be adapted to recursively find the maximizer: 



we propose a new algorithm to efficiently compute the estimated con- 
text trees. We prove the strong consiste ncy of the procedure. O ur proof 



is inspired by some arguments given in ICsiszar fc Talatal l|2006l 'l , which 



handles the ca se of a single (b ut possibly infinite) context tree source es- 



timation; as is lGarivien (|2006l ). the size of the trees is not bounded in the 
maximization procedure. We also present a simulation study showing the 
significant advantage of our procedure, for the estimation of the shared 
contexts, over a procedure that works separately on each dataset. 

The paper is organized as follows. In Section [J] we present the joint 
context tree estimation problem and the notation. Section |3] is devoted to 
the presentation of the penalized maximum likelihood estimator we study 
in this paper. For an appropriate choice of the penalty function, a strong 
consistency result is given. We describe in Section |4] how to efficiently 
compute the joint estimator. This is a challenging task, as the number 
of possible models grows exponentially with the sample size. We show 
how to take advantage of the recursive tree structure to build an efficient 
greedy algorithm. The value of this estimator is experimentally shown in 
Section [5] through a simulation study. The proof of the consistency result 
is given in Appendix [B] It relies on a technical result on the Krichevsky- 
Trofimov distribution that is given in Appendix lAl 
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2 Notation 



Let j4 be a finite alphabet, and A* = U„gN^" the set of all possible 
strings including the empty string e. Denote also by = Un>i^" the 
set of non-empty strings. A string s £ A'^ has length \s\ = n if s £ A", 
and we note s = Si.\i,\. The empty string has length 0. The concatenation 
of strings s and s' is denoted by ss' . s' is a suffix of s if there exists a 
string u such that s = us'; it is a proper suffix if u ^ e. 

A tree r is a non-empty subset of A* such that no si e r is a suffix of 
any other S2 € t. The depth of a finite tree t is defined as 



A tree is complete if each node except the leaves has exactly |j4| chil- 
dren (here |j4| denotes the number of elements in A). Note that {e} is a 
complete tree. 

Let Va be the {\A\ — l)-dimensional simplex, that is the subset of 
vectors p = (pa)aeA in K,'^' such that pa > 0, a £ A and J2aeAP<^ — 
To define a stationary context tree source, we need a complete tree r and 
a parameter 6 € Va, that is ^ = {0{s))seT where, for any s € t, 9(s) € Va- 
The A-valued stochastic process Z — (^„)n6a is said to be a stationary 
context-tree source (or variable length Markov Chain) with distribution 
W-r,B if for any semi-infinite sequence denoted by z-ao-.o, there exists one 
(and only one) s € r such that s is a suffix of z-oo-.-i, and such that, for 
any n > \s\, if the event = Z-n:-i} has positive probability, the 

conditional distribution of Zq given {Z^n-.-i ~ ^;-n:-i} is 0(s) and thus 
depends only on Following Rissanen, an element of t is called a 

context. In the case when r = {e}, the source is called memoryless. 

For any s € t, any integer n and any zi:„ € A", denote by S{s; Zi-.n) 
the string with the symbols that appear after an occurrence of s in the 
sequence zi-.n- Formally, 



where denotes the concatenation operator. When Zi^^s\:i~i ~ s, we say 
that Zi is in context s. Besides, denote by I{z\:n \ t) the set of indices i of 
zi:n that are not in context s for any s € t: 





I{zi:n; t\ 



■) = {i € {1, . . . , n} : VS e T, Z(i-\a\)wl:i-l ^ s) . 
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Then, if P,,e (Zi:„ = zi.,,) > 0, 

»6-r(zi,„;T) 

nf'e(s)(S(s;^i:„)) , 

where for i? G Pa, denotes the probability distribution of the memo- 
ry less source on A with parameter i9. 

Assume that X = (X„)„ga and Y = {Yn)„ez, are independent sta- 
tionary context tree sources. Let us define subsets uq, (Ti and (T2 of A*, 
and parameters So = (6lo(s))se<To, 61 = (6'i(s))se<Ti, 62 = {92{s))sea2, 
9i{s) £ Pa, s £ ai, i — 0,1,2 by the following properties: X has distribu- 
tion P^j^(£)|-j_ej^), Y has distribution PT2,(eo>e2)' ^'^'^ 

cri n cro = 0, era n cro = 0, (1) 

Ti := (7i U (Jo is a complete tree, (2) 

T2 := (72 U (70 is a complete tree, (3) 

VsG(7in(72, ei (s) / ^2 (s) . (4) 

ao is the set of shared contexts, that is the set of contexts which intervene 
in both sources with the same associated probability distributions. 

Given two samples Xi.„ = [Xi, . . . , X„) and Yi:m = {Yi, . . . , Ym) gen- 
erated by X and Y respectively, the aim of this paper is to propose a 
statistical method for the joint estimation of ao, ai and (72, and conse- 
quently of 60, 9i and 62. 

This is a model selection problem, in which the collection of models 
is described by possible cro, ui and (72, 's and for fixed ao, ai and (72 the 
model consists of all P<Tiu<To,(flo,ei) ^"^^ ^^aUcro, (00,62) ^^r any possible 9i, 
i = 0,1,2. 

We propose in the next section a selection method using penalized 
majcimum likelihood for the entire set of observations. 

3 The joint Context Tree Estimator 
3.1 Likelihood in context-tree models 

For any ((70,(71,(72) satisfying {!]), ^ and Q, define ■M(a-o,<7i,iT2) the 
set of distributions (Q on A^ x A^ of form 

Q = P<TiU^0,(fl0,fll) ® lP<T2Uao,(9o,92) := ® 
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for some 6*0 = {do{s))seao, 9i = (^'i(s))ss<ti , 6*2 = (6'2(s))se<T2, such that 
9i{s) G Va, s £ ai, i = 0,1,2. Here we do not assume Q. 
For any integers n and m, any xi-.n £ A" and yi.m £ ^4™ and any string s, 
denote by S{s; xi:n;yi:?n) ~ S{s; xi;n)S{s;yi;m) the concatenation of the 
Xi's in context s, and of the j/i's in context s. One has : 

<Q = = yi:m) = 

iG/(a:i.„;o-iUcTQ) 

n ^<^2U<To,{eo,e2) C^' = y^\Yl■.^-l = 

i6-r{!/l:77i;o-2UCTo) 

n ^Oo(') iS{s;Xl:n;yi:m.)) Y\_ ^BlM {S (s; Xl-.n)) Po^(s) {S {s; yv.m)) ■ 

(5) 

Let us now note for any s £ A* and any a £ A: 

n n 
i=|s| + l j=|s| + l 

where it is understood that an empty sum is 0, and 

m m 
Nm,Y{s,a)= ^ lY._^^^,._^=s,Yi^a, Nm.Y (s) ^ ^ ly. _ | ^ I ^ . _ j =s • 

I=|s| + 1 !=|s| + l 

Observe that Nn^x (e) ~ n and Nm.,Y (e) = wi. Then, when maximizing 
over A^(c7(,, 0-1,0-2) the hkelihood as given by ((5]), we shall use the approxi- 
mation that the first two terms may be maximized as free parameters (so 
that their maximization gives 1). Thus we shall use the pseudo maximum 
log-likelihood 

0-0,0-1,0-2 1 = N„^x [s, a) lor 



iVn,X (S) 

A^m.y (s,ffl) ' 

A^m.y (S) 



Nn,x (s) + iVm.y (s) 



where by convention for any non negative integer p, log 2 = 0. Here 
logu denotes the logarithm of it in base 2. 

For any string s, we shall write Qx (■!«) and Qy (-Is) the probability 
distributions on A given by: Va £ y4, 

Qx {a\s) = (Q = ffl|-'«^i;|s| = s) , 

Qy (a|s) = Q (Vjsi+i = a\yi:\s\ = s) , 
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and Qx (-js), Qy (-js) and Qxy (-Is) the probability distributions on A 
given by: Wa £ A 

« , I s N„,x{s,a) , \ ^ iVm,i'(s,a) 



{a\s) 



Nn,X (S, g) + Nm,Y (s, fl) 



Nr^,X (S) + Nm,Y (s) 

whenever N„,x{s) > 0, Nm.,Y{s) > and Nn,x{s) + Nm,Y{s) > respec- 
tively. In the same way, with some abuse of notation, we note Qx and 
Qy any |s|-marginal probability distributions on A'*' defined respectively 
by (Qx and (Qy. 



3.2 Definition of the joint estimator 

Let pen(-) be a function from N to R, which will be called penalty func- 
tion, and define the estimators So, cti and CT2 as a triple of maximizers 
of 

Cn,m ((To, (Tl, 0-2) = £n,rn (fo, (71, (72) 

- ^^—^ — ^(|(Jo|pen(n + m) -|- |(Ji|pen(n) -I- |(T2|pen(m)) 

over all possible (cro, (7i, (T2) satisfying ([l), ([2} and (|3}. The BIC estimator 
corresponds to the choice pen(-) = log(-). Notice that it is enough to 
restrict the maximum over sets (to,(Ti,(T2 that have strings s with length 
|s| < n V m — 1. Indeed, if a string s has length |sj > n, then for any 
a £ A, Nn.x{s,a) = 0, if s has length |sj > m, then for any a £ A, 
iV„,Y(s,a) = 0. 
For any integer D, denote 

((?d,o,(?i3,i,ctd,2) = argmaxCn,™ ((To, (^i, (T2) 

where the maximization is over all ((To, (Ti, (T2) satisfying HI), ((2| and Q 
and such that for any s £ (tq Ucri U(T2, |sj < D. Then, as explained before, 
the joint estimator (cto,o'i,(T2) is seen to be: 

((To, (Tl, (T2) = (lT„Vm-l,0, (TnVm-1,1, (TnVm-1,2) • 

3.3 Consistency of the joint estimator 

Now assume that X and Y are independent with distribution 
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where ctq, (j^, cr| are finite subsets of A* satisfying ([T]), ([2]) and ([S]), and 
such that (gl) holds. 

Theorem 1 ^ssttme t/iat n and m go to infinity m such a way that 



Van — = c, < c < +oo. 



(6) 



Assume moreover that for any integer k, 

pen (fc) = log k. 
Then the joint estimator is consistent, i. e. 

((T0,5l,52) = (o-Q, (Ji, (jj) 

-eventually almost surely as n goes to infinity. 
We have presented our joint estimator with a generic penalty pen(-), and 
Section |4] describes a procedure for computing efficiently this estimator in 
the general case . However, the consistency result only covers the choice of 
the BIG penalty [Schwaxg 11 f 9781 ). that is the penalty which is the logarithm 
of the number of observations times half the number of free parameters. 
The proof of Theorem [1] is given in Section [B] 



4 An Efficient algorithm for the joint es- 
timator 

In this section, we propose an efficient algorithm for the computation of 
the joint estimator with no restriction on the depth of the trees. The 
recursive tree structure makes it possible to maximize the penalized max- 
imum likelihood criterion without considering all possible models (which 
are far too numerous). The greedy algorithm we present here can be seen 
as a non-trivial extensio n of the Context Tre e Maxim ization algorithm 



that w as first presented in lWillems et al\ (|l995l '), see also 



Csiszar fc Talata 



( 20061 1 . For each possible node s of the estimated tree, the algorithm first 
computes recursively, from the leaves to the root, indices Xa{Xi:n), XsiXi-.m) 
and XB{X\;n\Y\;m) ■ In a second step, the estimated tree is constructed 
from the root to the leaves according to these indices. 
For any string s let 

Jv„,x {s, a) \ 



Ps (Vi:m) = n 



N^,x (s) 
Nm,Y (s, a] 

Nm,Y (s) 



Y (s,a,) 
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and let 



/AT ^^ _L AT fc „\ \ ^n,x(s,a)+N^ YU,a) 
-Ts (,^l:n, -rirmj — T^T- — ^TT 



' \0 



where again it is understood that for any non negative integer n, (£) = 1. 
Notice that, because of possible side effects, Ps {Xv.n\ Yv.m) is not in gen- 
eral equal to Ps [Xi-nYi-m). 



Step 1: computation of the indices 

For any set of strings a, we denote by as the set of strings us, u € a: 
as = {us : u € a}. Let cr be a tree, and let 

Ra;3 {Xl:„) = ^ log P„ {Xv.n) " |lT|pen (n) , 

Ra;s {Yv.m) = ^ log (Yi:m) - |cr|pen (m) , 
RcT,s = ^ log (Xi:n;>l:m) " l^^^lpen (n + m) . 

U^(T S 

Let D be an upper-bound on the size of the candidate contexts in ao U 
(Ti U (72. Note that it is sufficient to consider D = m to investigate all 
possible trees. Define for any string of length \s\ = D: 

Vs (Xl:„) = R{,y^s , Xs (^l:n) = 0, 

Vs {Y^:m) = R{,y,s {Yl:m) , Xs (^1:™) = 0, 
:m) — max (A^lini ^l:m) '■, R{e};s (-^l:n) H~ -R{e};s (^lim)} 5 

and 

„ , / l,if 14(Xl:„;yi:^) = i?t,J.,(Xl:„;yi:^) 

[ 2 , else. 

Then compute recursively for all s such that |s| < D: 

Vs {Xv.n) = max i (Xl:„) ; ^ Vas {Xl-.n) i , 

I aeA ) 



and 

' , if (Xl:„) = R{,ys {Xv.n) 

1 else. 



Xs {Xv.n) = 



Vs (Yl-.m) = max i P{e};« iYl■.„^) ; ^ I , 

I aeA J 
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Define also 



Xs (Vlim) = 



, if Vs {Yl:m) = -R{.};. 



1 else. 



Vs {Xi:n;Yi:m) = max < 



-R{e};s {Xl:„;Yl 

:m J 

K (Xl:„) + Vs (Yl:^) 



and 



1 , if K (Xl:„; yi:„) = 7?{,}., (Xl:„; , 

2 , if K Yl:™) = K + K (^1:™) 

3 else. 



For any (ctq, cri, 0-2) satisfying ([T|, ([2]) and ((Sj, define 

-R(o-i,<T2,CTo);s {Xl:n', Yl;m) = Rai.s (^l:n ) +-Ro-2 ; s (i^l :m ) +-Ro-o ; s (^l:n; yiim) • 

Notice that 



R(ai,a2,tl)\a {X\;n', Y\;m) — -Rcti;s (^l:n) + -Rct2;s (Yl™) 



and 



-R(0,0,<To);s (^l:n; yL:m) — RrTg;s {Xl:„; Yl:m) ■ 

Moreover, remark that 

• either ai and (T2 are the empty set and ctq is not the empty set, 

• or (To is the empty set and neither ui nor (72 are the empty set, 

• or none of them is the empty set. 

Step 2: construction of the estimated trees 

Once the indicators Xe (Xi-.n) and Xa (Xi-.n) have been computed, the 
estimated sets can be comput ed recursively from the root to the leaves. 
Recall that Csiszar and Talata 
string s such that |sj < D: 



Csiszar fc Talata 



( 20061 ') prove that for any 



K(X) =maxi?,;s(X) 



(7) 



and 



K(Y) = maxi?„;s(F). (8) 
(J 

Call (Txi.„ (s) (resp. uyi.^ (s)) a tree maximizing (resp. ((8|). o-j¥i,„ (s) 
and (Jyi.^ (s) can be computed recursively as follows: start with the 
strings s of length D; 
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• if Xs {Xi:n) = 0, then axi.,„ (s) = {e}, 

• if Xs {Xi:n) = 1, then axi.,„ (s) = UaGA(TXi.,„ (as) a, 

• if Xs {Yi:m) = 0, then avi,^ (s) = {e}, 

• if Xs {Yi:m) = 1, then ctyi,^ (s) = UaeAO-y^^^ (as) a. 

Namely, for any string s such that |s| < D, define tJi (s), (J2 (s) and 
ao (s) as; 

• if Xs {Xi:n;Yi:m) = 1, then (Ji (s) = a2 (s) = and ao (s) = {e}, 

• if Xs iXi:„;Yi:m) = 2, then ai (s) = crxi;„ (s), (72 (s) = o-yj^^ (s) and 
cro (s) = 0, 

• if iXi:„;Yi;rn) = 3, then tJi (s) = UaGAO"! (as) a, a2 (s) = UaeAf7-2 (as 
and ao (s) = UaeAcro (as) a. 

Validity of the algorithm 

The next proposition shows that the two-step procedure described above 
computes the maximum pseudo-hkelihood estimator in the joint model. 

Proposition 1 For any string s such that |s| < D, 

Vs (Xi:„; Yl;m) = maX_R(o.;^,o-2,CTo);s (Xl-.n', Yi.m) 

where the maximum is over all {ao, fi, CT2) that verify {Ip, |^ and ^ and 
such that 

Vit e 0-1 U CT2 U ao, \u\ + |s| = D. 

In particular, 

aD,o — ao (e) , an,! = ai (e) , aD,2 = a2 (e) . 

Proof: 

The proof is by induction. Observe first that 

Vs {Xl:n) + Vs (Yl;m) = maX -R(cti,<T2,0);s {Xl:n; Yl:m) ■ 

Now, if |s| = D, then either ai = a2 ~ {e} and ao = 0, or tJi = 0-2 = 
and ao = {e}, and we have 

Vs{Xi:„;Yi 

Let us now take |s| < D and assume that Proposition [T] is true for all 
strings as, a £ A. The maximum of the R{cri,tT2,rTo);s {Xi:n',Yi;m) over 
all {ao, ai, a2) that verify ((TJ, ((2} and ((3)l and such that Vm G ai U (72 U 
CTo, \u\ + |s| = D, is reached by a triple {ai, a2, ao) such that: 
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either ctq — {e}, in which case ai and (T2 are necessarily empty and 

-R(<Tl,(T2,CTo);3 {Xl:„; Yl:m) ~ -R(0,0 , {e}) ;s {Xl-.n', Yl-.m) ~ R{e};a (Xl-.n', Yl-.m) 

or at least one among ai and (72 is equal to {e}: then cro = and 

-R(cti,o-2,o-o);s (^l:n,;yi:m) = Rcri;B{Xl:n) + RtT2:s{Yl:m) ~ Vs{Xl:„) + Vs {Yl;m) 



Csiszar fc Talata 



(|2006D : 

• or (7i,cr2,o"o are all different from {e}, and then each ai,0 < i < 2 
can be written as ai — Ua£A(yi{ci)a; note that it is possible that, 
for some i £ {0, 1, 2} and some a £ A, ai{a) is empty, or even that 
ai is empty. In any case, for each a G A it is easily checked that 
ai{a),a2{a) and ao{a) satisfy ([2} and Q. Thus 

R{^1^^2:^o):S (.Xl:n] Vl;m) = ^ ^ -^(cri (a) ,(72 (a) ,cTo (a)) ;as {Xl:n] ^l;m) 

ag A 

= > m.ax_ R(cri,cr2,<yo);as {Xl;n;Yl:m) 
'■ — ■ CTl,o-2,CTo 

= ^ ^ Vas {X\;n\ Y\;m) 

by the induction hypothese. 
To conclude the proof, it is enough to be reminded that, by definition. 



Vs Yx:m) = max < R{^}-s {Xl-.n; Yl-.m) , 



Obviously the computational complexity of this procedure is propor- 
tional to the number of candidate nodes s, which is equal to the number of 
distinct subsequences of Xi:„ and Yi.m, and hence quadratic in n and m. 
However, if necessary, it is possible to obtain a linear complexit y algorithm 
by using compact suffix trees, as explained in iGarivieii l|2006l) . 



5 Simulation study 

In this section, we experimentally show the value of joint estimation when 
the two sources X and Y share some contexts. We compare the results 
obtained by the BIG joint-estimator described above with the following di- 
rect approach. First, we estimate tx using the standard BIG tree estimate 
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Tx = Tx{Xi;n), and we independently estimate ty using ty ~ Ty(Yi:m)- 
Then, for all contexts s that are present both in fx and in ty, we compute 
the chi-squared distance of the conditional empirical distributions: if this 
distance is smaller than a given threshold, we decide that s is a shared 
context. The value of the threshold was chosen in order to maximize the 
frequency of correct estimation. 

5.1 A particularly favorable example 

First consider the following case: 

• X and Y are {l,2}-valued context-tree sources; 

• Qx is defined by the conditional distributions (Qx(^o ~ l|^-i = 
1) = l/3,(Qx(Xo = l|X_2.-i = 12) = l/3,<Qx{Xo = l|X_2:-i = 
22) = 2/3; 

• (Qy is defined by the conditional distributions (Qy(Yo ~ ll^-i ~ 

1) = 3/4,<Qy{Y = = l|y-2:-l = 12) = l/3,(Qy(yo = l|y-2:-l = 

22) = 2/3; 

• the estimates are computed from and Yi:m with n = 500 and 
m = 1000; 

• the probability of correctly identifying the tree by each method is es- 
timated by a Monte-Carlo procedure with 1000 replications (margin 
of error f» 1.5%). 

In that example, we hence have ao — {12, 22}, ai = {1} and (T2 = {1}. 
We compare our joint estimation procedure with separate estimation using 
the following criteria: 

• the probability of correctly identifying tx (resp. ty); 

• the probability of correctly identifying simultaneously tx and ty; 

• the probability of correctly identifying (To,(Ti,(J2; 

• the KuUback-Leibler divergence rates KL((Qz|(Qz) between the sta- 
tionary processes <Qz and (Qz for Z £ {X, Y}, which are computated 
by using the fact that both X and Y are Markov chains of finite or- 
der. 

The results are summarized in Figure [T] It appears that the joint es- 
timation approach has a significant advantage over separate estimation 
on all the criteria considered here, with one restriction: in some cases, 
the estimation of either tx or ry can be deteriorated, while the other is 
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Figure 1: Comparative performance of separate and joint estimation in a fa- 
vorable case (probabilities of correct estimation). KLx and KLy denote 
KL(QxlQx) and KL{Qy\Qy), respectively. 

(more significantly) improved. In all cases, the probability of correctly 
estimating both tx and Ty at the same time is increased. 

5.2 A less favorable example 

On the other hand, when X and Y share no (or few) contexts, then the 
joint estimation procedure can obviously only deteriorate the separate 
estimates by introducing some confusion between similar, but distinct 
conditional distributions of X and Y. An example of such a case is the 
following: 

• X and Y are {l,2}-valued context-tree sources; 

• (Qx is defined by the conditional distributions <Qx{Xo — l|X_i = 
1) = l/2,(Qx(Xo = l\X-i = 2) = 2/3; 

• (Qy is defined by the conditional distributions (Qy(Yo ~ Ij^-i ~ 

1) = l/2,Qy(yo = 1|F_2:-1 = 12) = 3/5,(Qy(yo = l|y-2:-l = 

22) = 3/4; 

• the estimates are computed from and Yi;m with n = 1000 and 
m = 1500; 

• the probability of correctly identifying the tree by each method is es- 
timated by a Monte-Carlo procedure with 1000 replications (margin 
of error « 1.5%). 

In that example, ctq = {1}, o"i ~ {2} and a2 ~ {12, 22}. The results are 
summarized in Figure [5] In this case, <Qx and (Qy are quite close, and 
the joint estimation procedure tends to merge them into a single, common 
distribution. Thus, the probability of correctly inferring the structure of 
(Qx and (Qy is significantly deteriorated. 
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Figure 2: Comparative performance of separate and joint estimation in the 
unfavourable case (probabilities of correct estimation) . KLx and KLy denote 
KL(QxlQx) and KL(Qy|Qy), respectively.. 



5.3 Influence of the penalty term 

A natural question is whether the performance of joint (or even sepa- 
rate) estimation can be significantly improved by using other choices of 
penalty functions, especially choices of the form pen(n) = Alog(n) for 
some positive A. The BIG choice A = 1 may be imp roved by using a 
recent data-driven procedure called slope heuristic, see iBirge fc Massart 
( 20071 ) . However, in the present case, the attempts to tune the penalty 
function by using the slope heuristic merely resulted in a confirmation 
that the BIG choice could not be significantly improved on the examples 
considered here. In fact, in addition to the difficulty to detect the dimen- 
sion gap and thus the minimal penalty in our simulations (which could 
be expected, as the number of models is very large whereas the sample 
are not huge), the ideal penalty estimator was never observed to be very 
different from A = 1. 



5.4 Discussion 

The simulation study strongly indicates that the joint estimation proce- 
dure has a significantly improved performance when the two sources do 
share contexts and conditional distributions which appear with a signif- 
icant probability in the samples. On the other hand, when the sources 
share no or few contexts, the procedure may introduce some confusion 
between the estimates, as could be expected. 

When the goal is joint estimation, deterioration in the estimation of 
one of the trees seems to be the price to pay for better estimating the 
other tree, and the net effect is positive. 

The predictive power of the estimated model is reflected by a measure 
of discrepancy between the true law of the process and the law of the 
estimated distribution. We chose to consider KuUback-Leibler divergence, 
as it is naturally associated to logarithmic prediction loss in information 
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theory. As expected, a significant improvement is observed for the joint 
estimator in presence of shared contexts. 
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Appendix 



A Technical Lemma 



Let P(7 denote the probability distribution of the memoryless source with 
uniform marginal distribution on A. For a context tree r and a string 
zi:k G a'' denote by 5*^(1^,21:*:) the concatenation of the symbols that 
are not in context s for any s £ r, that is St(ijJ, zi:k) = O ;^ rr,.^ ^ ^i- 
Then the Krichevskv-Trofimov lKrichevskv fc Trofimovl (|l98ll ) probability 
distribution is defined as 



Yl KT(S(s;a;i:„;yi:™)) JJ KT {S(s; xi..„)) JJ KT (S{s;yi..m)) , 



(9) 
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where 



' (^) Ua^A r (Nn,. {s, a) + N,,,y (s, a) + i) 



r {^y-^^ r (^Nn,a^ {s)+Nn,y (s) + 



KT(S(s;a;i:„)) 



KT(S(s;yi:„)) = 

r(|)' 'r(iv„,,(.) 

Recall that for any tree a, D (a) is its depth : 

D (a) — max {\s\ : s G cr} . 



+ 



1^1 



Following Willems I Willems et al\ ([1995|) (see also 



Gassiad (l201ol ). and ref 



erences therein), Jensen's inequality leads to the following result: 
Lemma 1 For any x^in dfid any yi-.-mj 

-logKT(„(,,^j,„2) (^i:"!!/!-™) ^ "-^n.m (0-0,0-1, a2) 

+ [D (o„ U 01) + D ((70 U (72) + lo-ol + |oi| + I0-2I] log 1^1 

00 I lO; 



+ 



|j4| — 1(1 I, /'n + m\ ,i I, I n \ , / m 



Oil log - — - + I02I lo£ 

oi y \ 102 



B Proof of Theorem [T] 



The proof is divided into four parts. 

1. We first prove that eventually almost surely, |cto| < fcn and |cti| < fe„ 
and 1 02 1 < kn with 



log n 



log log log n 

For any ((70,01,02) satisfying ([T|, (l2| and (O, define B{o-o,cti,(T2) as 
the set of {xi:n,yi:m) in A"'^"^ such that 



(Xl;„,Yl;m) = (a;i:,i, J/l:m) (oo,CTl,CT2) = (oQ , (7l , (72 ) , 



SO that 



Q* (((70,01,02) = (00,0-1,02)) 



E 



J* ((Xi:,i, Yl:m,) — {xi;n, Vl-.m)) ■ 



(3:i:„,!/l:™)eS(<,o.„i,<,2) 
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,ai ,<T2) 1 then 

^n,m (o-Q, ai, a2)--^i^^-L— !^(|(jo|pen(n+m) + |(Ji|pen(n) + |o-2|pen(m)) 
> C,m (0-0,0-1,0-2)- ^^^^2 (ko |pen(n+m) + |a-i |pen(n) + |a-2|pen(m)), 
and using Lemma[T] if {xi:n,yi:m) G B{(to,<ti,(T2)> then 

< 2 

< KT(,„,,,,,,) (a;i:„;t/i.„02''<"''"'*'"*^''^' 
with ti — \ai\, i = 0, 1, 2, and 
H[n,m,to,ti,t2) = 

II to log — + ii log - + t2 log ' 



*ii,m(CTo,CTi,CT2) + i^^— U-((|(Tq |-to)pcn{n+m) + (|(Tj |-ti)pcn{n) + (|CT2 |-t2)pcn(m)) 



2 l_ \ to J 

-f ^^^^2 ^^\"'o \ - to)pcn{n + m) + {\al \ - ti)pen(n) + (|oJ| - t2)pen(m)) 

+ [3to + 2ti + 2t2] log \A\ 
\A\ ~ 1 ( 

= 2 1 - *o log to - ti log ti - t2 log t2 + |oo I log (n + m) + 

|(Ji I log (n) + |cr2* I log (m) } + [3*0 + 2ti + 2*2] log | Al 

using pen(-) = log(-) and using that for a complete tree a, D{a) < 

\a\. 

Thus, 

Q* ((?o,ai,a2) = (ao,<7i,02)) < 2«("'™'*0'*i''^), 

and 

Q* (|oo| > k„ or |oi| > fc„ or |ct2| > fcn) 

n V m n\/m 

< ^ ^ i?'(to,ti,t2)2"<"'™'*«'*i''=) 

TiVm TiVm 

+ ^ ^ F(to,ti,t2)2^'"'"''*0'*i'*=' 

il=fe,i + l to>t2=0 

nVm nVm 

+ ^ ^ F(to,ti,t2)2^<"'™'*°'*i'*^^ 

t2 = fen + l to,il=0 

where F {to,ti,t2) is the number of (0-0,0-1,02) satisfying ((2} 
and (|3} and such that |oo| = to, |o-i| — ti, and |o-2| — t2- 
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But the nu mber of comple te trees with t elements is upper bounded 
by 16', see iGarivieil (2006), so that, denoting by (^) < 2' the bino- 



mial coefficient, one has 

F(to,ii,t2) < |^*o + *iji6*«+*i|^*° + *^jl6*"+*^ 

< ]^g4*0+2tl+2t2 

Using the fact that for any constant a, —tlogt + at is bounded on 
R"*", and using ([Sjl one gets that for some constants C\, C2 and C3, 

Q* (|5o| > fcn or jail > k„ or |a2| > k„) < Ci2-^^'=" '"^^ 

But 

fc„logfc„ 
lim — - — - — = +00 
logn 

so that one gets that for another constant C, 

(J 

<Q* (|cto| > fcn or |cti| > fc„ or |ct2| > K) < 

and using Borel-Cantelli's Lemma, we obtain that (Q*-eventually 
almost surely, |cto| < kn and l^il < fc„ and |ct2| < k„. 

2. We prove that (Q*-eventually almost surely, no context is overesti- 
mated. 

It is sufficient to prove that, (Q*-almost surely, if (ctq, (Ti, (T2) sat- 
isfy 11]), (121) and ^ and are such that for some i, Gi contains some 
string that has a proper suffix in a*, there exists ((Jo, <7"2) satis- 
fying ([!}, ((2)1 and (|3} and such that, eventually, C„,m((To, o"2) > 
C„,m(o-o, 0-1,(72), so that (5o,(?i,02) / {(^0, (^1,(^2) ■ 
Consider first the case where is overestimated. Let {00,0^,(72) 
satisfy ([!}, ([2} and (|3]) and be such that ao contains some string 
that has a proper suffix in (Tq. Let s = av, a £ ^, be the longest 
such string, and let it £ be the corresponding suffix of v. For 
i G {0, 1, 2}, let Si = A+v n Oi and define 



c^o = ((7o\5'o) U {v} , ai = (cti\S'i) , 02 = {(12X82) 
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Then 

Cn,m(^0, ^1, ^2) — Cn,m{C0,Cn, CT'l) 



^ [A^n.x (w, V) + N^,Y (v, 6)] log 
\A\-1 



N^^x {v,b) + Nm,Y (v,b) 



Nn,X {V) + Nm,Y {v) 

log (n + m) 

Nn,X {w,b)+Nm,Y {w,b) 
N„,x {W) + Nm,Y (W) 



- — — — log (n + m) 



-£j£"-'-'-(^)-^-'»'} 

By definition of the maximum likelihood, the above expression is 
lower-bounded by: 

Cn,m{^0, ffl, ff2) — Cn,m{(70, <Tl, Cr2) 

- ^-^^ log (n + m) I 

- E I E ^".^ ^) log (^^ (^i"')) - log (^) j 

weSi KbeA ) 

- E { E ^-.^ ^) log (Q^ (^-k)) - log (m) I 
weS2 KbeA ) 

Notice that 

Q*x{-\v)=Qy{-\v) = Qx{-\w) 
for any ui G So U 5i U S2. 

It follows from part 1 of the proof that we only need to consider trees 
cTi such that jail = o(logn). Notice also that since D{ai) = o(logn), 
for any h & A, 

N„,x (_v, b)= ^"'X ^) + o(log"), 

uieSouSi 
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Nm,Y {V, ,y {w,b) + o(logn). 

Let KL(gi|g2) ~ X^ag a (") (°) '^^'^0^^^ KuUback-Leibler 
divergence between two probability measures qi and 52 on yl, with 
the convention that 01og(0/2;) = for x >Q and 2;log(2;/0) = +00 
for a; > 0. Since the minimum of all positive transition probabilities 
in (Q* is positive, one gets 



Cn,m(o"0,Cri,(T2) — C,i,m (o"0 , 0"l , ""2 ) 



?x (bk) 
xy {b\w) 



+ (|So|-l) 



1)M 



■ log (n + m) 



+ E E^".^(^'^)i°g(- 

™GSi 6GA V 

+ E E ^) 

tiieS2 beA 



+ |5i|^r^log(n) 



+ |52|^-^log(m) 



+ o(log n) 



- E [^".x W + Wl KL [QxY {■\w) \Q*x (» 

+ (|So|-l)^r^log (n + m) 



- E ^".^ W KL (Qx (» IQx (») + |5: 



log (n) 



- E W (^i- (■1'^) (W) + l-S": 

+ o (log n) . 



2 1 ^ log(m) 



Csiszar fc Talata 



(|2006h . for 



According to typicality Lemma 6.2 of L 
all 5 > 0, for all w such that N„^x{w) > 1 and for all 6 G ^ it holds 
that, (Q*-eventually almost surely. 



\Qx (bH-Qx {b\w)\ < 
Besides, Lemma 6.3 of 



Nr,,X (W) 



Csiszar fc Talatal (|2006l ) states that 



Kl{Qx{-\w)\Q*x{-\w)) <E 



6GA 



Q'x {b\w) 



Handling similarly the terms involving Qy and Q*xy, a-nd denoting 
Qmin > the minimum of all positive transition probabilities in (Q*, 
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we obtain that for any 5 > 0, (Q*-eventually almost surely for all 
possible {ao,<Ti,a2) : 

C„,m(o-0, (72) — C„,m(cro, (Tl, (72) > 

- 4^ |5o I log (n + m) + ( |5o I - 1) log (n + m) 

-^|5i|log(n) + |Si|^iy^log(n) 

-4^|S2|log(m) + |52|^^log(m) 

which is positive, for all possible (ctq, cri, (T2), (Q* -eventually almost 
surely. This follows from the fact that we consider complete context 
trees, and therefore l^ol > 1, \So\ + \Si\ > \A\ and |So| + 1521 > \A\. 
Consider now the case where cr*, i = 1 or i = 2 is overestimated. 
Let {ao, (Ji, 0-2) satisfy ([T]), ([SJ and ^ and be such that ai contains 
some string that has a proper suffix in a*. Let s = av, a £ A, he 
the longest such string, and let it £ a* be the corresponding suffix 
of V. For i = 0, 1, 2, let again. Si = A^v n ai. Then, either So = 0, 
and the problem boils d own the the overe s timat ion of a single tree: 



the consistency result of ICsiszar fc Talatal (|2006l l applies and shows 
that denoting 

a-i = {al\S^) U {v} , o-j ^ aj,j , 

we have C„,m((Jo, 172) > Cn, m (o"o, ai, (72) Q* -eventually almost 
surely. Or a"o has also been overestimated, so that one may apply 
the previous proof. 

3. Consider now the underestimation case. If (Tq has been underesti- 
mated, there exists s £ ao which is a proper suffix of sq £ uq. For 
i = 0, 1, 2, let Si = A'^s n a* , and define 

ffO = (O"o\{s}) U So , CTl = (Tl U Si , CT-2 = 0-2 U S2. 
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Then 

Cn,m(^0, ^1, ^2) — Cn,m(C0, fl, C^i) 



Isr^lAT ( U\ , AT I >.M1 / ^n,X (w, &) + iV^,y (to, 6) \ 



1^1 -1 



log (n + m) 



- V [iV„,x (s, 6) + N^^Y (s, 6)] log f ^ 



Af„,x (w,h) \ \A\-\ 
N„,x (w) 

Nrr^.Y {W,b) \ \A\-1 
Nm,Y (w) 



log (n) 



■ log (m) 



iVn.x js, h) + AT^.y (s, b) 
X (s) + iV^.y (v) 

+ ^— — log (n + m) 

Notice that for any string u, for any h ^ A, ^Nn,x (w, b) and ^N„^x (u) 
converge (Q* almost surely to Qx (ub) and (w) respectively, and 
^Nm,Y {u, b) and ^Nm,Y (tt) converge (Q* almost surely to \Qy (ub) 
and ^Qy respectively. 
Thus, <Q* almost surely, 

Cn,m(o-o, 0-1,^2) - C„,m.{(ro,cn,a2) = -O(logn) 



{wb) + -Q*Y (wb) 
c 



log 



+ -EE^OH-b)log(^ 

„.c «„ he 4 ^ ^Y 



weS2 beA 



66A 



Qx (sb) + -Q*Y (sb) 
c 



X (w) 

Qy {"wb) 
y{w) 

log 



Tx (sb) + (sb) 

Qx (s) + -cQy (s) 



+ o(n) 



= -0(logn) + o(n)+n ^ ^ log (^M] 



+ n ^ ^ -Q*Y (wb) log 



Qy ('"'^) 



u)eSouS2 6eA 



6eA 



Qx (sb) + -Qy (sb) 
c 



log 



Qy (w 

*x (sb) + IQ?^ (sb) 



Qx (S) + \Q*Y is) 
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because for w £ So, Q*x (wb) — Qy (wb). Since 
for any b G Jensen's inequality implies that 

(s) 



wSSqUSi 

and the inequality is strict for at least one b G A, for otherwise, s 
would be a context for Qx- Similarly for any b £ A, 

Using the concavity of the entropy function 



S«-<''>-(tW)-i5«<'"'-(iif 

>l^\Qx (sb) + -Qy (sb) log ' 



beA ^ ' 

so that there exists 5 > such that 



Cn,m[o(,, fl, (72) — C„,m(cro, 0"l, £72) > Tli 

(Q* -eventually almost surely. 

If (Ti, j = 1 or i = 2 has been underestimated, then the problem 
boils down to the standard underestimation of a single context tree. 
Defining (with obvious notation) 

CTi = ((7l\ {s}) U 5i U So , CTj=(Tj,j/i, 



it is proved in lCsiszar fc Talatal (j2006l ). Section III, that (Q* -eventually 
almost surely, Cn,m(o-o, ffi, 0-2) > C„,m(o-o, cri, 0-2). 

4. We have thus proved that, for i = 1 and i = 2, ao U at — ctq U a* , 
(Q* -eventually almost surely. Let (cto,(7i,CT2) satisfy IT]), ([2]) and ((3]) 
and be such that, for i — 1 and i = 2, ao U ai — ag U a* . There 
remains to check that (Q* almost surely, if there exists a string s 
such that 

• s E ao, but s G o"! and s G (72, 

• or s G cri and s G (T2, but s G ao. 
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then (ao, (71, (T2) 7^ (o-q, fi, (T2) eventually. 

Consider first the case where s € ao, but s € at and s € a|. Define 
ao = (cro\{s}) , ai=aiLI {s} , 0-2 = ct2 U {s} . 

Then 



Cn,m{ao, 0-1,0-2) — Cn,m(o-o, 0-1, a'z) = 

+ ^iV„..(s,6)log(^^'^^^^^^ 

Nm,Y {S, b) log 



beA 



V iV„,x(s) 

A^m.Y (S, b) 



Nm,Y (S) 

X) [-^".^ («' + -^'".y («. b)] log 



f)6A 

|A| - 1 



N„,x (s, &) + Nm,Y (s, 6) 



{log (n + rn) — log n — log m} 

.«.(ff).lE«H..).o.(fM 

-O(logn) 

(Q* almost surely. But the quantity into brackets is positive by 
the strict concavity of the entropy function, unless for any b € A, 
Qx{b\s) = Qy(6|s) which would mean that s £ ao- 
Consider now the case where s £ ai and s € a2, but s € ag. Define 

ao =aoD {s}, 
ai = (t7i\{s}) , 
a2 = {a2\{s}) . 



Cn,m(oo, Ol, 02) — Cn,m(o-o, Ol, 02) = 

'N„,X {S,b)+N^,Y {s,b) 



J2 ^^'X {s, b) log 



beA 



lN„,x (s, b) + Nm,Y (s, b)] log 

beA 

' Nn.x (s,b) 
N„,x (s) 

Nm,Y {S, b) 
Nm,Y (s) 

\A\ — 1 

H — {log n + log m - log (n + m)} . 



Nn,X (S) + Nm,Y (s) 



J2 ^-.y («' ^) log 
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Using Taylor expansion until second order of it log u, one gets 

C„,m(<7"0, <7"1, (T2) — Cn,m(o"0, 0"2) 



-T 

--E 

2 ^ 

--E 
2 ^ 

The sequences 



{[N„,X {S, b) + iV^.y {S, h)] - [iVn.X (S) + Nm,Y (s)] Qx(&k))' 



[iVn,x (s) + iVm,y (s)] Q3f(b|s) 



(iVn,X (s,&)-iVn,X (s)Q-''x{h\s)f 

N„,x (s)Q3,(6|s) 

Nn.,Y {s)Q*Y{b\s) 

{log n + log m — log (n + m)} . 



(1 + 0(1)) 



(iVn,x(s,6)-iV„,x(s)Qi(b|s))„>o , 

{Nm,Y {S, b) - Nm,Y (s) O^(b|s))„>o , 

are martingales with respect to the natural filtration. Thus, it fol - 
lows from the law of iterated logarithm for martingales iNeveul (|l972l l 
that, (Q* almost surely, 

C„,m(a-o,a-i,a-2) — C„,m(cro, cri, (72) = O(loglogn) 

1^1-1 



+ 



{log n + log m — log (n + m)} , 



so that (Q* almost surely, 

C„,m(o-0, O"!, 0-2) — C„^rn{o'0,'^l, Cr2) > 

eventually. This ends the proof of Theorem [T] 
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